Ocean Conservancy has taken the initiative to organize a global network of cities to combat marine pollution through the Urban Oceans program. This program takes multiple approaches to address ocean plastic pollution. First, they mobilizes cleanup campaigns across each city based on volunteer efforts with international and local support groups, such as the International Coastal Cleanup. Second, they research each city’s sanitation and urban systems to identify key details regarding what causes trash to end up in oceans & waterways, sharing findings and recommending action plans to local communities and decision makers. And third, working with stakeholders directly to create changes in products, practices, and behaviors as alternative solutions to their current needs. The program has been successful in engaging local communities, businesses, and governments to take action against marine pollution.
With these efforts, each campaign requires an evaluation of local context and detailed understanding of the city’s systems and dynamic. This aspect of the process requires long-term planning and methodical decision-making to understand the pre-existing conditions in place. Hence, our model aims to streamline the process of understanding the general structure of each city-context and provide a comprehensive solution on where efforts and resources should be allocated to. Ultimately saving time on where to implement their waste-reduction strategies.
We based our focus on various cities from the Resilient Cities Network and Urban Ocean program. The Resilient Cities Network is a coalition of cities around the world that assist one another in building safer, more sustainable, and equitable futures for their populations. Each of which shares their knowledge, experiences, and policy practices to empower one another. A subsection of the Resilient Cities Network is the Urban Ocean program, a partnership with Ocean Conservancy tackling the ocean plastic problem. The city displayed as the example for our model is Chennai, India. The city is apart of the Urban Oceans 12 city network with the goal of proactively enhancing their sanitation infrastructure to combat plastic pollution in their region.
The 12 participating cities apart of the Urban Ocean program as of May 2024 are:
Pollution is a global problem that takes many forms contaminating every aspect of the environment. Some of this pollution, specifically plastic-waste, ends up in our oceans, harming marine life, contaminating drinking water, and hurting local economies. In response, Ocean Conservancy’s Urban Ocean program has been developing projects that mitigate marine pollution, assess waste management, and enable cities to address ocean plastics and resilience. These projects intend to deploy “zero-waste” pilot strategies in cities around the world. With this wide-spanning outreach and diversity in partnerships, an assessment on site selection and resource allocation becomes a repeated step in the process that hinders efficiency each waste-reduction campaign.
Our objective is to develop a site assessment model that identifies effective zero-waste site locations for national and multinational use. This geospatial risk assessment model serves the purpose of predicting litter accumulation based on globally sourced data with a repeatable framework on an international scale. The results of the model aim to evaluate which sections of a given area have the highest likelihood to produce and contain litter relative to its surroundings. For our case, the Urban Ocean program intends to dedicate “zero-waste” solutions in these areas for the most effective impact for each deployment.
Data Sources: Marine Debris Tracker, OpenStreetMap, Google My Maps
Research & Partnership:Ocean Conservancy/Urban Ocean, Resilient Cities Network,
In terms of our data & variables, our primary environmental health indicator, marine debris/litter, was acquired via manual download from Marine Debris Tracker (https://www.debristracker.org/data/). The Marine Debris Tracker is a web-application & public-use database that tracks all recorded global litter disposal efforts from participating organizations. The data for each city was carved out by a boundary box determined by initial-research political boundaries. From there, all debris recorded between January 2021 - February 2024 was compiled to a CSV and used as our primary independent variable. Each selected city required its own evaluation in terms of usable debris data over the several year time span. Ideal locations featured several thousand data points across 2-3 years for early & late stage model development. Some cities included as few as ~500 data points. Despite the repeated process of the model itself, results and accuracy laid contingent on data quantity and external factors available to each region.
OpenStreetMap (OSM) data was used to provide additional context to the litter data. OSM featured land-use, proximity to water, waste site facilities, roads, and other indicators of commercial activity. The aim of these variables was to act as proxies for litter cases, providing insight on where litter most likely accumulated or ends up. Each variable was then placed on the fishnet grid with the litter data, computed using a Chi-Squared test, and evaluated for association between the variables and litter.
For our model, we directed our focus on data modeling from the Urban Ocean cities. To develop our model for each respective city (or future use), each area was determined by a ‘boundary box’ that was created by a custom KML file using Google My Maps (https://www.google.com/mymaps) or sourced online. Each boundary’s drawing was dictated by political boundaries and practical scope of where data was recorded within the city, further methodology of each individual city is detailed on the interactive dashboard.
After the boundary box was created, the data was then projected to a fishnet grid. The fishnet grid was used to count the number of litter points in each cell in a format that can be compared to by each grid cell in the city.
Following data extrapolation, the litter files were overlaid on a 500x500 meter fishnet grid to create a uniformed data structure for each city. The fishnet grid was created using the ‘sf’ package and projected to the respective city’s UTM projection. The grid was used to aggregate and visualize the litter data and other variables to create a uniformed data structure for each city. The variables used in the model were selected based on their potential association with litter accumulation. These variables included land-use, proximity to water, waste site facilities, roads, and other indicators of commercial activity. The aim of these variables was to act as proxies for litter cases, providing insight on where litter most likely accumulated or ends up. Overall the goal of the data gathering and variables chosen provide a model an understanding the city itself, and in turn understand the systems in place and the environment assessment needed.
Each aspect of the data gathering, through manual download, creation, or API, can now be coalesced into one cohesive visual. As mentioned with the 500x500 meter grid, all variables including litter as measuring based on an attribute regarding the cells. This grid is structure by the boundary box created from Google My Maps. This boundary file projects the city and defines its boundary on where the model is evaluating. Boundary files for our data are knowingly compatible with KML or GEOJSON files, or other file types that can be converted to raster layers.
For example, our litter data is not counted as individual data points for model development, but rather a sum of points within each cell. This allows for a better indication of frequency and density of litter in each area. The same method is applied to our dependent variables as well for uniformed formatted and consistency in structure. Below code & figure 3 demonstrate the example process of importing the litter CSV, boundary KML, and the creation of the overlay used for data exploration and analysis.
knitr::include_graphics("https://raw.githubusercontent.com/TrevorKap/MUSA810-Marine-Pollution/main/Data/pictures/FishnetProcessSlide.png")#step to load city's data
litter <- read.csv('https://raw.githubusercontent.com/TrevorKap/MUSA810-Marine-Pollution/main/Data/mdt-dataChennai.csv')
# data filter and projection transformation
litter_p <- litter%>%filter(master_material == 'PLASTIC')%>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326, agr = "constant")%>%st_transform('EPSG:32643')
#img <- raster("/Users/mr.smile/Desktop/UPENN/Spring24/CPLN790/data/population_ind_pak_general/population_10_lon_80_general-v1.5.tif")
chen_bdry <- st_read('https://github.com/TrevorKap/MUSA810-Marine-Pollution/raw/main/Data/Chennai.kml')
chen_bdry <- st_set_crs(chen_bdry, 4326)%>%st_transform('EPSG:32643')
temp_bd <- st_read('https://github.com/TrevorKap/MUSA810-Marine-Pollution/raw/main/Data/Chennai.kml')
temp_bbox <- get_bbox(temp_bd) # get the bounding box (the projection of temp_bd should be epsg4326)
temp_fish <- create_fish(chen_bdry) # get the fishnet of the city (the projection of chen_bdry should be meter degree)
final_net <- countfishnet(temp_fish, litter_p) # create base fishnet with litter (also the one used as final one)
final_net <- pn_gen(stor_df) # add osm point data and knn calculation result into the final dataset
#temp_point <- raster_process(img,temp_bd) # convert the raster file to point one
#pop_result <- pop_process(temp_point, temp_fish, 32643) # summary the population result
#final_net <- add_pop(pop_result,final_net) # add the pop result into the final dataset
final_net <- moran_gen(final_net,stor_df) # calculate the moran's I result into the dataset
# DONE!
chen_net <- final_netcity_data_list <- lapply(cities, load_city_data)
names(city_data_list) <- cities
bd_data_list <- lapply(cities, load_city_kml)
names(bd_data_list) <- bd
bd_data_meter_list <- lapply(cities, load_city_kml_meter)
names(bd_data_meter_list) <- bdM
for (i in seq_along(city_data_list)) {assign(cities[i], city_data_list[[i]])}
for (i in seq_along(bd_data_list)) {assign(bd[i], bd_data_list[[i]])}
for (i in seq_along(bd_data_meter_list)) {assign(bdM[i], bd_data_meter_list[[i]])}base_path <- "https://raw.githubusercontent.com/TrevorKap/MUSA810-Marine-Pollution/main/Data/stored_city/"
city_names <- c("santa_fe", "semarang", "panama", "can_tho", "melaka", "salvador",
"surat", "santiago", "bangkok", "chennai", "mumbai", "pune")
city_data <- list()
for (city in city_names) {
file_path <- paste0(base_path, city, "_net.geojson")
net_sf <- st_read(file_path)
net_sf <- st_set_crs(net_sf, 32643)
city_data[[city]] <- net_sf
}
list2env(setNames(city_data, paste0("net_", city_names)), envir = .GlobalEnv)cnt_wt <- function(litter,net){
temp <- st_join(litter,net,join = st_within)
temp <- temp %>% filter(!is.na(uniqueID))
temp_sum <- temp %>%
group_by(uniqueID)%>%
summarise(count_un = n_distinct(username))
net <- left_join(net,st_drop_geometry(temp_sum),by = 'uniqueID')
net <- net %>%
mutate(count_un = replace_na(count_un, 0),
count = count/count_un,
count = replace_na(count, 0))
return(net)
}
net_mumbai <- cnt_wt(Mumbai,net_mumbai)
net_chennai <- cnt_wt(Chennai,net_chennai)
net_bangkok <- cnt_wt(Bangkok,net_bangkok)
net_can_tho <- cnt_wt(Can_Tho,net_can_tho)
net_melaka <- cnt_wt(Melaka,net_melaka)
net_panama <- cnt_wt(Panama_City,net_panama)
net_pune <- cnt_wt(Pune,net_pune)
net_salvador <- cnt_wt(Salvador,net_salvador)
net_santa_fe <- cnt_wt(Santa_Fe,net_santa_fe)
net_santiago <- cnt_wt(Santiago,net_santiago)
net_semarang <- cnt_wt(Semarang,net_semarang)
net_surat <- cnt_wt(Surat,net_surat)z_score_normalize <- function(x) {
(x - mean(x)) / sd(x)
}
normal_city <- function(data){
data <- data%>%
mutate(across(where(is.numeric) & -'count', z_score_normalize),
across(where(is.numeric), ~replace(., is.nan(.), 0)))
return(data)
}
net_salvador <- normal_city(net_salvador)
net_santa_fe <- normal_city(net_santa_fe)
net_santiago <- normal_city(net_santiago)
net_semarang <- normal_city(net_semarang)
net_surat <- normal_city(net_surat)
net_bangkok <- normal_city(net_bangkok)
net_can_tho <- normal_city(net_can_tho)
net_chennai <- normal_city(net_chennai)
net_melaka <- normal_city(net_melaka)
net_mumbai <- normal_city(net_mumbai)
net_panama <- normal_city(net_panama)
net_pune <- normal_city(net_pune)net_bangkok <- net_bangkok %>% mutate(city = 'Bangkok',country = 'Thailand')
net_can_tho <- net_can_tho %>% mutate(city = 'Can_Tho',country = 'Vietnam')
net_chennai <- net_chennai %>% mutate(city = 'Chennai',country = 'India')
net_melaka <- net_melaka %>% mutate(city = 'Melaka',country = 'Malaysia')
net_mumbai <- net_mumbai %>% mutate(city = 'Mumbai',country = 'India')
net_panama <- net_panama %>% mutate(city = 'Panama_City',country = 'Panama')
net_pune <- net_pune %>% mutate(city = 'Pune',country = 'India')
net_salvador <- net_salvador %>% mutate(city = 'Salvador',country = 'Brazil')
net_santa_fe <- net_santa_fe %>% mutate(city = 'Santa_Fe',country = 'Argentina')
net_santiago <- net_santiago %>% mutate(city = 'Santiago',country = 'Chile')
net_semarang <- net_semarang %>% mutate(city = 'Semarang',country = 'Indonesia')
net_surat <- net_surat %>% mutate(city = 'Surat',country = 'India')
net_total <- rbind(net_bangkok,net_can_tho,net_chennai,net_melaka,net_mumbai,net_panama,net_pune,net_salvador,net_santa_fe,net_santiago,net_semarang,net_surat) %>% mutate(uniqueID = 1:n())
net_total_li <- net_total %>% dplyr::filter(count != 0)
net_total_li <- net_total_li %>%
mutate(count = (count - mean(count)) / sd(count))The litter data features a wide variety of item categories and general information about each piece. Despite the quality of detail for each recorded sample, this only accounts for litter that has been identified, recorded, and disposed of. This data does not account for litter that was identified but never disposed of, accounted for, or assumed. Regardless, areas where litter was not recorded in a general surrounding could not confidently be assumed to be present or not present. Hence, the purpose of the model is to predict where litter is most likely to accumulate within the city based on the data available.
Figure 2 shown below depicts the dependent variables used for model development. Each serve their own purpose in characterizing the city that is evaluated. Social & economic activities are depicted using restaurants and population density. Restaurants are an indication of commercial corridors and general ‘downtown’ aesthetic that attracts large populations, typically located in central hubs of urban activity. Their presence, especially when in clusters, can indicate a higher likelihood of commercial hubs and in turn program the model to classify these areas as major activity. Another metric for characterizing a city landscape is population density. Regardless of population itself, density is a key factor in determining where people are located in relation to the rest of the city. This metric supports our geospatial assessment in determining relative risk of litter accumulation, rather than independent probability.
Geomorphology describes the city’s physical geography for the model. Trash, when moved around by nature, weather, human activity, can a higher probability of moving towards lower elevations and bodies of water. Using water distribution as a factor accomplishes two pieces of information in one variable. Naturally, water distribution displays where water is located. When trash makes contact with a body of water, it rarely exists, requiring cleanup intervention to remove it, acting as an indication of higher risk. Secondly, it acts as a general but not definite idea of terrain. Proximity to water does not fully account or indicate for elevation, but it does provide a general idea of a path of travel for loose trash naturally moving.
Waste management or waste facilities serve the purpose off collecting, storing, transporting, recycling, and disposing of waste in an area to prevent or minimize adverse effects on the environment and public health of an area. These facilities act as a factor of litter risk because of their location in relation to the city center, where litter has been located, and their capabilities to cover their assign jurisdiction. Their critical role in the sanitation infrastructure and system in a city is a key factor in understanding any operation weaknesses.
Landuse is a general indicator of the city’s zoning and urban planning. Industrial zones, for example, are likely to have litter accumulation due to the nature of the activities. Residential zones, on the other hand, are less likely to have litter accumulation due to the opposite nature. The model uses landuse to potentially discover patterns in the city that may not necessarily be identified by intuition alone, but is still important is creating the picture of how an area is structured. With this generalization in mind, zoning alone is not enough for describing landuse. Roads also are a vaiable incorporated into the model to frame the network of the city, also acting as a proxy to density. Grid streets or segments closer together often represent denser, potentially taller buildings that represent activity, as well as vise-versa. Hence, each variable is a general feature known to be present in every city and allows for replicated use to any place in the world.
In addition to the dependent variables in their counted presence, we implemented a K-Nearest Neighbor analysis (also displayed) to determine the proximity of each data point to one another and incorporate that as an additional variable to our model. K-Nearest Neighbor (KNN) allows us to find the proximity of another variant to determine commercial/high activity spots in the area. The nearest neighbor factor was repeated for land-use, roads, restaurants, water, and waste facilities. We added KNN as a factor to place emphasis on density and location-based processing. This addition acts as a fail-safe for our model’s cell calculation given that examining count alone does not incorporate the real-world application of proximity.
Local Moran’s I measures spatial autocorrelation which measures the degree of spatial relationship between observations, measuring how closer values are clustered in a space. We measure these among the dependent and independent variables to factor in any spatial lags identified within the region. Similar to KNN in terms of purpose, this factor allows the model to understand the relationship between the dependent and independent variables and apply that relationship in its calculations for each cell.
The visualized examples (Figure 3) display what the model is interpreting when making interpreting the independent variables. The examples are, waste management facilities (waste), water (streams, ponds, canals, etc), restaurants, and zoning (industrial, residential, retail). The overall selection was based on our hypothesis of areas of human activity leading to higher litter risk. The aim of these variables is to act as proxies for litter cases, providing an indication on where litter most likely accumulated or ends up. Each variable is then placed on a combined fishnet grid with the litter data, and evaluated for association between the variables and litter.
The visuals showing ‘count’ measure the number of parcels with the zone or number of buildings within the cells and does not account for density. The nearest neighbor maps (‘Variable’_nn) measured their proximity on a gradient scale, with the darker colors indicating closer proximity.
map_waste <- visual_count(net_chennai, "waste")
map_waste_nn <- visual_count(net_chennai,'waste_nn')
grid.arrange(map_waste, map_waste_nn, ncol = 2)map_water <- visual_count(net_chennai,"water")
map_restaurant <- visual_count(net_chennai,"restaurant")
map_road <- visual_count(net_chennai,'road')
map_industrial <- visual_count(net_chennai,'industrial')
map_residential <- visual_count(net_chennai,"residential")
map_retail <- visual_count(net_chennai,'retail')
grid.arrange(map_water, map_restaurant, ncol = 2)map_water_nn <- visual_count(net_chennai,'water_nn')
map_restaurant_nn <- visual_count(net_chennai,"restaurant_nn")
map_road_nn <- visual_count(net_chennai,'road_nn')
map_industrial_nn <- visual_count(net_chennai,'industrial_nn')
map_residential_nn <- visual_count(net_chennai,"residential_nn")
map_retail_nn <- visual_count(net_chennai,'retail_nn')
grid.arrange(map_water_nn, map_restaurant_nn, ncol = 2)The Principal Component Analysis (PCA) is a dimensional reduction technique used to transform high-dimensional data into a lower-dimensional space while preserving as much variance as possible. The PCA below is used to identify the most important variables in the dataset and their correlation with litter, ultimately creating synthetic variables that can augment the data input for the model.
The first output depicts the general metrics found in the independent variables, such as standard deviation, variance, and the proportion of variance explained by each component.
The Eigenvalues from the PCA are shown in the second graph. The Eigenvalues are the amount of variance explained by each component, quantifying the importance of each component in the dataset. The first dimension explains the vast majority of the variance, with a sharp decrease thereafter. This means that we only need 1 dimension of overlap when evaluating our model and additional dimensions of evaluation will not be useful thereafter.
The PCA Variables graph shows the correlation of each variable in comparison to the first two dimensions. The arrows that represent each variable have varying directions and lengths, indicating mostly little correlation between the first dimension and second dimension. But the few variables that have relatively longer lengths have stronger relationships between both dimensions and can be used for the model. To better understand their importance we evaluate their Squared Cosine.
The squared cosine value (Cos2) in the third graph shows the quality of the representation of the variables on the factor map. The closer the value is to 1, the better the representation of the variable. What we notice particularly about these variables with the largest quality of representation is that they are statistical factor rather than a numeric value. Almost each independent variable with a ~0.75 value is either a KNN or significance value. The only exception being sum_pop (population density) and avg_pop (average population).
Similar to the second graph, highest correlation among the dependent and independent variables can be explained by the color and direction in the last graph. The last PCA graph is the basis for choosing the several variables. Based on the result, the selected variables included ‘waste_sig_dis, restaurant_sig_dis, residential_sig_dis, water_sig_dis, residential_nn, industrial_sig_dis, industrial_sig, restaurant_sig, industrial_nn, road_sig_dis, residential_sig, restaurant_sig, restaurant’ as the ‘shortened model’ independent variables. Each variable with a ’_sig’ suffix indicates where the presence of the variable, considering KNN and Moran’s I, have significant impact on where the dependent variable, litter, is located.
corr <- st_drop_geometry(net_total) %>% dplyr::select(!c(uniqueID,cvID,city,country))
corr_nor<- scale(corr)
corr_matrix <- cor(corr_nor)
#ggcorrplot(corr_matrix)
data.pca <- princomp(corr_matrix)
summary(data.pca)## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.1544504 0.36523214 0.34046877 0.31029899 0.25846779
## Proportion of Variance 0.6144093 0.06149576 0.05343942 0.04438824 0.03079783
## Cumulative Proportion 0.6144093 0.67590503 0.72934445 0.77373269 0.80453052
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.23518531 0.21725529 0.20707302 0.1916407 0.18403398
## Proportion of Variance 0.02549926 0.02175945 0.01976761 0.0169310 0.01561361
## Cumulative Proportion 0.83002978 0.85178923 0.87155684 0.8884878 0.90410144
## Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
## Standard deviation 0.17324598 0.15353018 0.143181926 0.1332383 0.120928775
## Proportion of Variance 0.01383673 0.01086663 0.009451127 0.0081840 0.006741655
## Cumulative Proportion 0.91793817 0.92880480 0.938255928 0.9464399 0.953181583
## Comp.16 Comp.17 Comp.18 Comp.19
## Standard deviation 0.120295484 0.102914317 0.101110151 0.100044032
## Proportion of Variance 0.006671229 0.004882686 0.004712992 0.004614127
## Cumulative Proportion 0.959852812 0.964735498 0.969448490 0.974062617
## Comp.20 Comp.21 Comp.22 Comp.23
## Standard deviation 0.097015031 0.09618051 0.092972876 0.088041257
## Proportion of Variance 0.004338956 0.00426463 0.003984921 0.003573384
## Cumulative Proportion 0.978401573 0.98266620 0.986651124 0.990224508
## Comp.24 Comp.25 Comp.26 Comp.27
## Standard deviation 0.080715923 0.068107836 0.058459405 0.052915014
## Proportion of Variance 0.003003486 0.002138461 0.001575491 0.001290818
## Cumulative Proportion 0.993227994 0.995366455 0.996941947 0.998232765
## Comp.28 Comp.29 Comp.30 Comp.31
## Standard deviation 0.0409395731 0.035875430 0.0252563016 0.0152463435
## Proportion of Variance 0.0007726696 0.000593337 0.0002940673 0.0001071615
## Cumulative Proportion 0.9990054343 0.999598771 0.9998928385 1.0000000000
## Comp.32
## Standard deviation 2.228071e-09
## Proportion of Variance 2.288576e-18
## Cumulative Proportion 1.000000e+00
## Comp.1 Comp.2
## count 0.03910654 0.02824418
## water 0.06939559 -0.16334301
## water_nn -0.25294001 0.17990491
## waste 0.06006527 -0.06645713
## waste_nn -0.28343700 0.02087865
## restaurant 0.12232184 -0.04222381
## restaurant_nn -0.27714032 -0.08945392
## road 0.16341393 0.41090861
## road_nn -0.13750912 -0.31917998
## industrial 0.04557497 -0.23972791
## industrial_nn -0.12543259 0.22518605
## residential 0.12280232 -0.03065005
## residential_nn -0.16985709 -0.10984955
## retail 0.08343909 -0.10307191
## retail_nn -0.27734971 0.02874484
## avg_pop 0.23916125 0.24137571
## sum_pop 0.25998533 0.29686304
## water_sig 0.09268665 -0.18719624
## water_sig_dis -0.26523016 0.18214397
## waste_sig 0.10033340 -0.10307020
## waste_sig_dis -0.25677004 0.09727076
## restaurant_sig 0.13120168 -0.05896123
## restaurant_sig_dis -0.28062883 -0.01118523
## road_sig 0.05390253 0.23012861
## road_sig_dis -0.09843262 -0.27218982
## industrial_sig 0.05587300 -0.26867581
## industrial_sig_dis -0.15335016 0.23961425
## residential_sig 0.12014661 -0.10190999
## residential_sig_dis -0.25040796 0.03631134
## retail_sig 0.10677570 -0.11630429
## retail_sig_dis -0.23770606 0.03603264
## count_un 0.05246863 0.02625072
fviz_eig(data.pca, addlabels = TRUE) + ggtitle("Eigenvalues from Principal Component Analysis") + plotTheme()n_colors <- 100
mako_gradient <- viridis(n_colors, option = "mako")
fviz_pca_var(data.pca, col.var = "cos2",
gradient.cols = mako_gradient,
repel = TRUE) + plotTheme()Our model goal is to predict the relative likelihood of litter accumulation in a given area based on the independent variables. The model will be built using the Random Forest and Linear Regression algorithms. The Random Forest model is a machine learning algorithm that uses an ensemble of decision trees to predict the likelihood of an event occurring, the event being the presence of trash. The Linear Regression model is a statistical model that uses a linear relationship between the dependent and independent variables to predict the continuous value of litter being present. We take both of these models/results and compare them to determine a mixed-use approach that factors both results with equal weight.
The following random forest model utilizes the ‘TidyModels’ package. The data is then split to a 75-25 train-test set. We specify the randomforest to function as a ‘regression’ to specify the type of output to be evaluated, that being the ‘count’ of litter in each cell. The data is then divided into 10 folds for cross-validation to measure accuracy among each iteration. The model is then tuned using a grid search to find the optimal hyperparameters at each level of different tree (decision) size, ranging between 500 - 2000 decisions. The best-performing model iteration is selected based on the Root-mean-square deviation metric, measuring the accuracy of the forecasted results of each decision. The mean error depicted in the table by each forest (iteration) lies around 0.844 - 0.847. The first result, Preprocessor1_Model10, has the best result with the lowest RMSE of 0.844. This model will be used for our final mixed-used approach.
set.seed(123)
net_tt_nor <- st_drop_geometry(net_total_li) %>%
dplyr::select(-c(uniqueID,cvID,city,country))
net_tt_nor$city <- net_total_li$city
net_tt_nor$uniqueID <- net_total_li$uniqueID
net_tt_temp <- net_tt_nor%>%dplyr::select(!c(uniqueID,city))
net_tt_split <- initial_split(net_tt_temp, prop = 0.75)
train_data <- training(net_tt_split)
test_data <- testing(net_tt_split)
rf_model <- rand_forest(
mode = "regression",
trees = tune(),
min_n = tune(),
mtry = tune(),
) %>% set_engine("ranger")
rf_recipe <- recipe(count ~ ., data = train_data)
cv_folds <- vfold_cv(train_data, v = 10)
rf_grid <- rf_grid <- grid_regular(
trees(range = c(500, 2000)),
min_n(range = c(9, 15)),
mtry(range = c(6, 30)),
levels = c(4, 3, 5)
)
rf_workflow <- workflow() %>%
add_model(rf_model) %>%
add_recipe(rf_recipe)
rf_results <- tune_grid(
rf_workflow,
resamples = cv_folds,
grid = rf_grid
)
show_best(rf_results, metric = "rmse")## # A tibble: 5 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 6 1000 15 rmse standard 0.844 10 0.0862 Preprocessor1_Model10
## 2 6 1500 15 rmse standard 0.845 10 0.0853 Preprocessor1_Model11
## 3 6 500 12 rmse standard 0.846 10 0.0866 Preprocessor1_Model05
## 4 6 1500 9 rmse standard 0.847 10 0.0858 Preprocessor1_Model03
## 5 6 2000 15 rmse standard 0.847 10 0.0852 Preprocessor1_Model12
final_rf <- finalize_workflow(
rf_workflow,
select_best(rf_results, metric = "rmse")
)
final_rf <- last_fit(final_rf, split = net_tt_split )
#results <- collect_metrics(final_rf)
#print(results)
temp_ts <- predict(final_rf$.workflow[[1]], test_data) %>%
rename(Prediction = .pred)
ts_bind <- cbind(test_data,temp_ts)
#best_results <- select_best(rf_results, metric = "rmse")
temp <- predict(final_rf$.workflow[[1]], net_total) %>%
rename(Prediction = .pred)
temp_cat <- risk_level(temp,'kmeans')
df_rf_rst <- cbind(net_total,temp_cat)The linear regression model, also built from ‘TidyModels’, takes a similar approach to predicting the count of litter in each cell. The model is trained using the same train-test split as the Random Forest using the ‘lm’ method, known as a linear method. Next, it is trained using the ‘quasi’ family to account for the overdispersion of the data. Then cross-validated using 60 folds to measure the accuracy. The results are then shown in the map below in terms of a risk-level category. Level 1 being the lowest, assuming little-no litter risk, and 5 being the highest, our newly ‘selected’ site areas to direct zero-waste sites on. The map depicting Chennai, India shows a scatter of category-5 assessment areas depicting the far north and south of the city to be the primary areas of focus. These areas are also not where our litter data was recorded which shows the model’s ability to predict areas of high risk without prior data, exactly as intended.
set.seed(223)
train_control <- trainControl(method = "cv", number = 60)
# Train the model
model <- train(count ~ ., data = train_data, method = "lm",family = "quasi",trControl = train_control)
test_temp <- predict(model,net_total)
test_cbind <- cbind(net_total,test_temp)%>%
rename(Prediction = test_temp)
temp_risk_cat <- st_drop_geometry(risk_level(test_cbind,'kmeans')) %>%dplyr::select(Risk_Category)
test_cbind <- cbind(test_cbind,temp_risk_cat)
city_viz('Chennai',test_cbind,'Linear Regression Model')Both Random Forest regressions and Linear Regressions have their benefits and drawbacks. The Random Forest model is known for its ability to handle large datasets and high dimensionality, but can be prone to overfitting and cannot handle ‘data noise’ as well as a Linear model. Random Forest models also tend to outperform linear regression in prediction accuracy when there are non-linear relationships between dependent and independent variables. We did not assume a linear or non-linear relationship between litter and the independent variables which also emphasizes the multi-modal approach. The Linear Regression model is known for its simplicity, but can be prone to underfitting and may not perform well if there is non-linear relationships among the variables. To evaluate the performance of both models, we compare the residuals by plotting them on a histogram to visualize the distribution and a scatter plot for the relationship.
The RandomForest scatterplot shows a weak relationship between the actual and predicted values, with an underestimation of the actual when the value exceed 2. These results tell us the model can indicate lower-risk areas better than higher-risk areas but have yet to identify a correlation between the independent and dependent variables. Meanwhile the Linear Regression residuals have a similar result but has slightly more curvature in the line of best fit (shown in green) which indicates mildly better performance when predicting places with higher actual values.
The histograms explain the same story differently. The Linear Model has residuals that over and underpredict the actual values, while the Random Forest consistently underpredicts the actual values. Now these results are where the mixed-model approach is incorporated. We must note that the data itself is already bias and skewed because of the inherent nature of litter, and the true verification of the model’s integrity is the real-world comparison to where it predicts.
# rf model result -- ts_bind
rf_test <- ts_bind %>% dplyr::select(count,Prediction)
lr_rst <- predict(model,test_data)
lr_test <- cbind(test_data,lr_rst)%>% rename(Prediction = lr_rst) %>% dplyr::select(count,Prediction)
rf_test$Residuals <- rf_test$count - rf_test$Prediction
lr_test$Residuals <- lr_test$count - lr_test$Prediction
ggplot(rf_test, aes(x = count, y = Prediction)) +
geom_point() + # Add points
geom_segment(aes(xend = count, yend = Prediction, x = count, y = count),
linetype = "dotted", color = "red") + # Residual lines
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "blue") +
labs(x = "Actual Value", y = "Predicted Value", title = "RandomForest Model Residuals") +
plotTheme()ggplot(lr_test, aes(x = count, y = Prediction)) +
geom_point() + # Add points
geom_segment(aes(xend = count, yend = Prediction, x = count, y = count),
linetype = "dotted", color = "red") + # Residual lines
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "blue") +
labs(x = "Actual Value", y = "Predicted Value", title = "Linear Regression Model Residuals") +
plotTheme()ggplot(lr_test, aes(x = Residuals)) +
geom_histogram(bins = 30, fill = "grey") +
geom_vline(xintercept = 0, linetype="dashed",
color = "black") +
labs(title = "Linear Model Residuals")+
plotTheme()ggplot(rf_test, aes(x = Residuals)) +
geom_histogram(bins = 30, fill = "grey") +
geom_vline(xintercept = 0, linetype="dashed",
color = "black") +
labs(title = "Random Forest Model Residuals")+
plotTheme()The Multi-Vision Model features a better distribution than the previous two with a mild right skew still existing, similarly to the RandomForest. This indicates there is greater consistency in this model and, if tuned for accuracy as well, is our best performer in the environmental assessment of relative litter accumulation risk.
The next visual features all three models compared. Here we see the Multi-Vision (bottom) having the best bell-curve representation. The Linear Model (middle) has a less consistent, duel distribution and is not as ideal as the Multi-Vision. The Random Forest (top) has a more consistent distribution but is not as accurate as the Multi-Vision because of the greater skew. The Multi-Vision model is the best performer.
Now we examine the scatterplot of multi-vision residuals. It still shows an identical relationship as the previous two, which may not be as desired, but also does not tell the full story. The model is not perfect, but it is the best performer in the context of the data and the model’s purpose. The meaning of the model is better explained by the Q-Q plot. The Q-Q plot shows the residuals of the model in comparison to a normal distribution of the actual dataset. The closer the points are to the 45-degree line, the more similar the dataset is to the theoretical distribution, indicating greater consistency. The data points are not perfectly aligned but are very close for the most part, creating a mild curvature in the middle, indicating skewness in the data. However, the tail-end of the points of higher theoretical quantities far divert from the line of best fit. These outliers deviate at an exponential rate but only after 1.5 theoretical quantiles, with a more systemic deviation between 1 and 1.5 quantiles. Overall, the results indicate that, despite the imperfections implied with the bias litter data, the model performs well relative to the circumstances and capabilities.
mixed_rst <- cbind(lr_test%>%dplyr::select(-Residuals),rf_test$Prediction)%>%
rename(lr_pred = Prediction,
rf_pred = 'rf_test$Prediction') %>%
mutate(Prediction = 0.5*rf_pred + 0.5*lr_pred,
Residuals = count - Prediction)
ggplot(mixed_rst, aes(x = Residuals)) +
geom_histogram(bins = 30, fill = "grey")+
geom_vline(xintercept = 0, linetype="dashed",
color = "black") +
ggtitle("Histogram of Multi-Vision Residuals") + # Add a title
xlab("Values") + # Label for the x-axis
ylab("Frequency") + # Label for the y-axis
plotTheme()grid.arrange(
ggplot(mixed_rst, aes(x = Residuals)) +
geom_histogram(bins = 30, fill = "grey") + plotTheme() ,
ggplot(lr_test, aes(x = Residuals)) +
geom_histogram(bins = 30, fill = "grey") + plotTheme() ,
ggplot(rf_test, aes(x = Residuals)) +
geom_histogram(bins = 30, fill = "grey") + plotTheme() ,nrow = 3)ggplot(mixed_rst, aes(x = count, y = Prediction)) +
geom_point() + # Add points
geom_segment(aes(xend = count, yend = Prediction, x = count, y = count),
linetype = "dotted", color = "red") + # Residual lines
geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "blue") +
labs(x = "Actual Value", y = "Predicted Value", title = 'Multi-Vision Residuals') +
plotTheme()ggplot(mixed_rst, aes(sample = Residuals)) +
stat_qq() +
stat_qq_line(colour = "red") +
ggtitle("Normal Q-Q Plot") +
xlab("Theoretical Quantiles") +
ylab("Sample Quantiles") +
plotTheme()The Random Forrest and Linear models are depicted side-by-side. When creating our final results, as mentioned, both are given equal weight input. We are looking for where the two models coincide and contrast in their predictions for litter assessment. These two results then provide an area where to dedicate zero-waste site resources. The final map, depicting into the 5 categories, narrows down where to focus resources in both a small enough area to a general scope yet large enough to generalize a search area that does not attempt to depict individual cells. This balance between specificity and generality develops a definitive solution and action plan for Urban Ocean.
test_cb_rst <- st_drop_geometry(test_cbind) %>%
dplyr::select(Prediction,Risk_Category)%>%
rename(Pre_lr = Prediction,
Risk_lr = Risk_Category)
rst_total <- cbind(df_rf_rst,test_cb_rst)%>%
rename(Pre_rf = Prediction,
Risk_rf = Risk_Category)
rst_total <- rst_total %>%
mutate(Prediction = 0.4*Pre_rf + 0.6*Pre_lr)
rst_total <- risk_level(rst_total,'kmeans')
grid.arrange(
city_viz('Chennai',test_cbind,'Linear Regression Model'),
city_viz('Chennai',df_rf_rst,'Random Forest Model'), nrow = 1)As Chennai, India being our example city for our model, the purpose of the demonstration is to also prove replicability among other cities. As a result, we have applied the model to 11 other cities around the world. The results are shown below. Each city has its own unique characteristics and features that are evaluated with the same variables. The results are shown in the 5-category risk levels and are consistent with the Chennai example. Each city has its own special results that cater to the local context. This newly found insight shall narrow the area of focus for Urban Ocean’s zero-waste sites, empower the local community, and provide a more sustainable future.
risk_v_new <-function(model_data,litter_data,model,city){
model_data$Risk_Category <- as.factor(model_data$Risk_Category)
ggplot() +
geom_sf(data = model_data, aes(fill = Risk_Category), colour = NA) +
geom_sf(data = litter_data, size = .3, colour = "red") +
scale_fill_manual(values = risk_palette, guide = "none") +
labs(title=model) +
mapTheme(title_size = 8)
}
city_viz_new <- function(cities,data,model){
temp <- data %>% filter(city == cities)
risk_v_new(temp,get(cities),model,cities)
}
risk_Chennai <- city_viz_new('Chennai', rst_total, 'Chennai, India')
risk_Bangkok <- city_viz_new('Bangkok', rst_total, 'Bangkok, Thailand')
risk_Can_Tho <- city_viz_new('Can_Tho', rst_total, 'Can Tho, Vietnam')
risk_Melaka <- city_viz_new('Melaka', rst_total, 'Melaka, Malaysia')
risk_Mumbai <- city_viz_new('Mumbai', rst_total, 'Mumbai, India')
risk_Panama_City <- city_viz_new('Panama_City', rst_total, 'Panama City, Panama')
risk_Pune <- city_viz_new('Pune', rst_total, 'Pune')
risk_Salvador <- city_viz_new('Salvador', rst_total, 'Salvador, Brazil')
risk_Santa_Fe <- city_viz_new('Santa_Fe', rst_total, 'Santa Fe, Argentina')
risk_Santiago <- city_viz_new('Santiago', rst_total, 'Santiago, Chile')
risk_Semarang <- city_viz_new('Semarang', rst_total, 'Semarang, Indonesia')
risk_Surat <- city_viz_new('Surat', rst_total, 'Surat, India')
grid.arrange(risk_Chennai, risk_Bangkok, risk_Can_Tho, risk_Melaka, risk_Mumbai,
risk_Panama_City, nrow = 2)grid.arrange(risk_Pune, risk_Salvador, risk_Santa_Fe, risk_Santiago, risk_Semarang, risk_Surat,
nrow = 2)